[My IODS project github data repository][https://github.com/maivu2054/IODS-project] ???


About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax. #Title: Introduction to open Data Science 2019

RStudio Exercise 1: Tasks and Instructions

I join this course to discover the world of data science. Beside, I would like to make friend with people who work in the health science area. I am currently a PhD student of the University of Eastern Finland.

My repository link

My course diary



Insert chapter 2 title here

Regression and model validation - exercise 2

Describe the work you have done this week and summarize your learning. ##Data wrangling step Prepare data from original file R script is here Data is here

Install some packages:

*First install.packages(c(“dplyr”,“GGally”,“tidyverse”,“xlsx”,“ggplot2”))

Question 1: Reading data

library(dplyr)
library (ggplot2)
library(readxl)
setwd(getwd())
library(dplyr)
learning2014 <- readxl::read_excel("data/learning2014.xlsx") %>%
  mutate_at(vars(gender), factor)
 dim (learning2014)
 head (learning2014)
 str(learning2014)

Question 2: Describle dataset and graphical overview of the data, summarize of variables in the data

Data includes seven variables: gender,age, attitude, deep, stra, surf, points. In which gender: M (Male), F (Female) age:Age (in years) derived from the date of birth attitude: Global attitude toward statistics deep: average of score related to deep learning stra: average of score related to surf: average of score related to surface questions points: Exam points full data from this :https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt

head(learning2014)
## # A tibble: 6 x 7
##   gender   age attitude  deep  stra  surf points
##   <fct>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 F         53      3.7  3.58  3.38  2.58     25
## 2 M         55      3.1  2.92  2.75  3.17     12
## 3 F         49      2.5  3.5   3.62  2.25     24
## 4 M         53      3.5  3.5   3.12  2.25     10
## 5 M         49      3.7  3.67  3.62  2.83     22
## 6 F         38      3.8  4.75  3.62  2.42     21
#describe some variables
as.numeric(learning2014$age)
##   [1] 53 55 49 53 49 38 50 37 37 42 37 34 34 34 35 33 32 44 29 30 27 29 31
##  [24] 37 26 26 30 33 33 28 26 27 25 31 20 39 38 24 26 25 30 25 30 48 24 40
##  [47] 25 23 25 23 27 25 23 23 23 45 22 23 21 21 21 21 26 25 26 21 23 22 22
##  [70] 22 23 22 23 24 22 23 22 20 22 21 22 23 21 22 29 29 21 28 21 30 21 23
##  [93] 21 21 20 22 21 21 20 22 20 20 24 20 19 21 21 22 25 21 22 25 20 24 20
## [116] 21 20 20 31 20 22 22 21 22 21 21 21 21 20 21 25 21 24 20 21 20 20 18
## [139] 21 19 21 20 21 20 21 20 18 22 22 24 19 20 20 17 19 20 20 20 20 19 19
## [162] 22 35 18 19 21
summaryvar <- learning2014 %>%summary (c("age", "deep", "stra", "surf", "points" ))
## Warning in if (length(ll) > maxsum) {: the condition has length > 1 and
## only the first element will be used
print (summaryvar)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
library(dplyr)
library (GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#see the correlation between variables
pairs.panels (learning2014)

Commenting on the distributions of the variables and the relationships between them

Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid gender is the discrete variable so we just forcus on continous variables like: age, attitude, deep, stra, surf, points

For example: correlation between age and attitude = 0.02 attitude toward statistics of people in the survey has positive correlation with age, deep learning, strategy, but negative correlation with surface learning

The diagonal is showing a histogram of each variable and it can be seen on the graph that point and attitude has a strong correlation (r = 0.44)

Question 3

library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20))
        )

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

Test association between points and attitude, deep, stra

cor.test(learning2014$attitude,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$attitude and learning2014$points
## t = 6.2135, df = 164, p-value = 4.119e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3044463 0.5521335
## sample estimates:
##       cor 
## 0.4365245
cor.test(learning2014$deep,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$deep and learning2014$points
## t = -0.12997, df = 164, p-value = 0.8967
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1622192  0.1423931
## sample estimates:
##         cor 
## -0.01014849
cor.test(learning2014$stra,learning2014$points)
## 
##  Pearson's product-moment correlation
## 
## data:  learning2014$stra and learning2014$points
## t = 1.8916, df = 164, p-value = 0.06031
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006340149  0.291945961
## sample estimates:
##       cor 
## 0.1461225

Results of correlation in the cor.test between attitude and points = 0.4365245 (p_value= 4.119e-09)

Results of correlation in the cor.test between deep and points = -0.01014849 (p_value= 0.8967)

Results of correlation in the cor.test between stra and points = 0.1461225 (p_value= 0.06031)

Look at the p_value: association between points and deep and stra is not statistically significant relationship so remove deep and stra variables from model

Question 4: relation ship between attitude and points

p1 <- ggplot(learning2014, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle( "Relation between attitude and points")
print (p4)

A linear model to the data. Points are explained by attitude.

The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] where Y represent points, X is attitude, \(\alpha\) is constant, \(\beta_1\) is regression coefficient for attitude, and \(\epsilon\) is a random term.

m1 <- lm(learning2014$points ~learning2014$attitude, data = learning2014)
results <- summary(m1)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.637 1.830 6.358 0
learning2014$attitude 3.525 0.567 6.214 0

Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high.

Question 5: Diagnostic plots.

plot(m1, which=c(1,2,5))





Chapter 3

Q2 Read data and print out the names of the variables in the data

describe the data set briefly

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires.The datasets provides information regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.

*Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). The data (n=382) has 35 different variables and is a joined data of the two datasets.

  • The names of the variables including

+school: binary variables: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira

  • Characteristics of participants: sex , age

  • Demographic and family information including variables: Address, famsize, Pstatus, Medu, Fedu,

  • Some variables about educational information: absences - numeric variable- number of school absences, failures: - number of past class failures, nursery: - attended nursery school (binary: yes or no), internet: - Internet access at home (binary: yes or no), guardian: - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

G1, G2, G3: express grades are related with the course subject, Math or Portuguese.

Q3: choose 4 interesting variables in the data and predict the association between them and high_alc

So I want to see how age (age). sex (sex), a romantic relationship (* romantic ), and number of school absences (absences*) variables effect to cohort consumption.

The hyopthesis is that getting older, being male, not in good relationship with partner, and more days of school absence will increase alcohol consumption.

Q4: Numerically and graphically explore the distributions of chosen variables

g1 <- ggplot(data = dat, aes(x = high_use, fill = sex))
g1 + geom_bar() 

From the plot 1, it can be seen that male take higher proportion than female in high_use.

g2 <- ggplot(data = dat, aes(x = high_use, fill =  romantic ))
g2 + geom_bar()

It can be seen here somehow not in a romantic relationship took more proportion of datohol consumption compare to in a romantic relationship

g3 <- ggplot(data = dat, aes(x = high_use, y= absences))
g3 + geom_boxplot()

More day absences from school increased datohol consumption

Q5: Logistic regression model

m <- glm(high_use ~ age+ sex + romantic+ absences, data = dat, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + romantic + absences, family = "binomial", 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7179  -0.8456  -0.6286   1.0823   2.1893  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.54258    1.70433  -2.665  0.00769 ** 
## age          0.16915    0.10226   1.654  0.09810 .  
## sexM         0.99159    0.24303   4.080 4.50e-05 ***
## romanticyes -0.29584    0.26585  -1.113  0.26579    
## absences     0.07283    0.01828   3.983 6.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 422.16  on 377  degrees of freedom
## AIC: 432.16
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind (OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.01064589 0.0003570945 0.290015
## age         1.18430022 0.9701859445 1.450229
## sexM        2.69551085 1.6842578252 4.375275
## romanticyes 0.74390627 0.4371001910 1.242806
## absences    1.07554516 1.0398995842 1.116863

Interpret result: + Age: getting older has associate to increase datohol usage, however, the effect of age variable is not significant (P-value = 0.10 )

  • Being a male increase probability use more datohol than 2.69 being female (OR = 2.69), the result is statistical significant

  • In a romantic relationship decrease alcohol consumption, however, the effect of this variable is not significant (P-value = 0.26)

  • The number of day absences from school increase the probability of use more datohol 1.07 (95%CI = 1.04 - 1.11)

Q6: Prection analysis

Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

probabilities <- predict(m, type = "response")

dat <- mutate(dat, probability = probabilities)

dat <- mutate(dat, prediction = probabilities > 0.5)

# tabulate the target variable versus the predictions

table(high_use = dat$high_use, prediction = dat$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   261    9
##    TRUE     88   24

So the model so

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)}

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
cv <- cv.glm(data = dat, cost = loss_func, glmfit = m, K = nrow(dat))
summary (cv)
##       Length Class  Mode   
## call    5    -none- call   
## K       1    -none- numeric
## delta   2    -none- numeric
## seed  626    -none- numeric





Question 2:Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library (tidyverse)
library (corrplot)
## corrplot 0.84 loaded
library (plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Load dataset Boston from MASS
data("Boston")

#look at data Boston
head (Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
#look at the structure of dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data(Boston)

Data summary: This is the dataset about housing Values in Suburbs of Boston Description: The Boston data frame has 506 rows and 14 columns with explaination of variables below:

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat: lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

Question 3: a graphical overview of the data and show summaries of the variables in the data

## NULL
##       crim               zn              indus              chas         
##  Min.   :-0.3900   Min.   :-0.5700   Min.   :-0.7100   Min.   :-0.12000  
##  1st Qu.:-0.2150   1st Qu.:-0.4050   1st Qu.:-0.3825   1st Qu.:-0.04750  
##  Median : 0.3200   Median :-0.2550   Median : 0.3950   Median : 0.02000  
##  Mean   : 0.1786   Mean   :-0.0550   Mean   : 0.1929   Mean   : 0.08143  
##  3rd Qu.: 0.4500   3rd Qu.: 0.2775   3rd Qu.: 0.6300   3rd Qu.: 0.09000  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##       nox               rm                age               dis         
##  Min.   :-0.770   Min.   :-0.61000   Min.   :-0.7500   Min.   :-0.7700  
##  1st Qu.:-0.360   1st Qu.:-0.29750   1st Qu.:-0.2625   1st Qu.:-0.5225  
##  Median : 0.305   Median :-0.21500   Median : 0.3050   Median :-0.3050  
##  Mean   : 0.190   Mean   :-0.01286   Mean   : 0.1736   Mean   :-0.1464  
##  3rd Qu.: 0.655   3rd Qu.: 0.19000   3rd Qu.: 0.5775   3rd Qu.: 0.2400  
##  Max.   : 1.000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000  
##       rad               tax             ptratio            black         
##  Min.   :-0.4900   Min.   :-0.5300   Min.   :-0.5100   Min.   :-0.44000  
##  1st Qu.:-0.2850   1st Qu.:-0.3050   1st Qu.:-0.2175   1st Qu.:-0.37750  
##  Median : 0.4600   Median : 0.4850   Median : 0.2250   Median :-0.22500  
##  Mean   : 0.2371   Mean   : 0.2364   Mean   : 0.1157   Mean   :-0.06071  
##  3rd Qu.: 0.6075   3rd Qu.: 0.6475   3rd Qu.: 0.3775   3rd Qu.: 0.16750  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##      lstat              medv         
##  Min.   :-0.7400   Min.   :-0.74000  
##  1st Qu.:-0.4000   1st Qu.:-0.46000  
##  Median : 0.4150   Median :-0.38000  
##  Mean   : 0.1407   Mean   :-0.06857  
##  3rd Qu.: 0.5775   3rd Qu.: 0.31000  
##  Max.   : 1.0000   Max.   : 1.00000

In fig1 shows the distribution of variables and their relationship. For example:chas variables is the categorical variable

Question 4: Standardizing the data

#scale variables
boston_scaled <- scale (Boston)
summary (boston_scaled%>%round (digits = 2))
##       crim                 zn                indus          
##  Min.   :-0.420000   Min.   :-0.490000   Min.   :-1.560000  
##  1st Qu.:-0.410000   1st Qu.:-0.490000   1st Qu.:-0.870000  
##  Median :-0.390000   Median :-0.490000   Median :-0.210000  
##  Mean   :-0.000119   Mean   :-0.002154   Mean   :-0.001304  
##  3rd Qu.: 0.010000   3rd Qu.: 0.050000   3rd Qu.: 1.010000  
##  Max.   : 9.920000   Max.   : 3.800000   Max.   : 2.420000  
##       chas                nox                  rm           
##  Min.   :-0.270000   Min.   :-1.46e+00   Min.   :-3.880000  
##  1st Qu.:-0.270000   1st Qu.:-9.10e-01   1st Qu.:-0.570000  
##  Median :-0.270000   Median :-1.40e-01   Median :-0.110000  
##  Mean   : 0.001838   Mean   : 5.93e-05   Mean   : 0.000138  
##  3rd Qu.:-0.270000   3rd Qu.: 6.00e-01   3rd Qu.: 0.480000  
##  Max.   : 3.660000   Max.   : 2.73e+00   Max.   : 3.550000  
##       age                 dis                 rad           
##  Min.   :-2.330000   Min.   :-1.270000   Min.   :-9.80e-01  
##  1st Qu.:-0.837500   1st Qu.:-0.800000   1st Qu.:-6.40e-01  
##  Median : 0.315000   Median :-0.280000   Median :-5.20e-01  
##  Mean   : 0.000415   Mean   :-0.000138   Mean   : 5.93e-05  
##  3rd Qu.: 0.907500   3rd Qu.: 0.660000   3rd Qu.: 1.66e+00  
##  Max.   : 1.120000   Max.   : 3.960000   Max.   : 1.66e+00  
##       tax                ptratio             black          
##  Min.   :-1.3100000   Min.   :-2.70000   Min.   :-3.900000  
##  1st Qu.:-0.7700000   1st Qu.:-0.49000   1st Qu.: 0.202500  
##  Median :-0.4600000   Median : 0.27500   Median : 0.380000  
##  Mean   : 0.0001383   Mean   : 0.00164   Mean   :-0.000079  
##  3rd Qu.: 1.5300000   3rd Qu.: 0.81000   3rd Qu.: 0.430000  
##  Max.   : 1.8000000   Max.   : 1.64000   Max.   : 0.440000  
##      lstat               medv           
##  Min.   :-1.53000   Min.   :-1.9100000  
##  1st Qu.:-0.79750   1st Qu.:-0.5975000  
##  Median :-0.18000   Median :-0.1400000  
##  Mean   : 0.00002   Mean   : 0.0000988  
##  3rd Qu.: 0.60000   3rd Qu.: 0.2700000  
##  Max.   : 3.55000   Max.   : 2.9900000
#to see class of the scaled object
class (boston_scaled)
## [1] "matrix"
#change boston_scaled from matrix into data frame
boston_scaled <-as.data.frame(boston_scaled)
summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
#creat a quantile vector of crim 
bins <- quantile(boston_scaled$crim)
#see quantile parts of bins
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low","med_high", "high") , include.lowest = TRUE)
table (crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

All means of variables move to 0. Now the crime variable is set to 4 categories according to how high is the per capita crime rate by town.

Question 5: Linear discriminant analysis

# linear discriminant analysis with target variable crime
lda_crime <- lda(crime ~ ., data = train)
lda_crime
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2524752 0.2648515 0.2301980 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.94367915 -0.8743925 -0.07933396 -0.8481811  0.45191393
## med_low  -0.04144408 -0.2967197 -0.04073494 -0.5690497 -0.12635821
## med_high -0.39747871  0.2385026  0.24280554  0.4048443  0.06129447
## high     -0.48724019  1.0173143 -0.01832260  1.0875725 -0.47173855
##                 age        dis        rad        tax     ptratio
## low      -0.8227059  0.8030999 -0.7003842 -0.7646649 -0.40468539
## med_low  -0.3695131  0.4099667 -0.5450034 -0.4914953 -0.03516108
## med_high  0.3892143 -0.3661050 -0.3958310 -0.2793902 -0.21947886
## high      0.8331414 -0.8814280  1.6361396  1.5126335  0.77846144
##                black       lstat         medv
## low       0.38555299 -0.76156148  0.514818653
## med_low   0.34964370 -0.15673296  0.005387198
## med_high  0.08222495  0.05046795  0.131583194
## high     -0.84496090  0.93654207 -0.705751591
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.08027726  0.850195033 -0.99674368
## indus   -0.03377658 -0.199193904  0.33899353
## chas    -0.05211307 -0.068540109  0.04832574
## nox      0.39141097 -0.740893366 -1.31687351
## rm      -0.09807733 -0.076009634 -0.17520562
## age      0.25266283 -0.207417886 -0.18666131
## dis     -0.07603848 -0.417743039  0.33383870
## rad      2.87547960  1.080145890 -0.02452709
## tax      0.07399023 -0.237278739  0.61520366
## ptratio  0.11454929 -0.004113137 -0.30938457
## black   -0.18407102  0.041380241  0.19451881
## lstat    0.23076384 -0.340591164  0.37001370
## medv     0.19698359 -0.524864945 -0.16449676
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9424 0.0427 0.0149
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
# plot lda with 2 discriminants
plot(lda_crime, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_crime, myscale = 1)

# eigenvalues:  
lda_crime$svd
## [1] 38.523633  8.196286  4.848112

The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (in Boston dataset) that gives the best possible separation between the groups.

We separate the crime into 4 groups: “low”, “med_low”,“med_high”, “high”. so the number of groups G=4, and the number of variables is 13 (p=13). The maximum number of useful discriminant functions that can separate is the minimum of G-1 and p. Thus, we can find in the result 3 useful discriminant functions (Proportion of trace: LD1, LD2, LD3) to separate.

Ratio of the between- and within-group standard deviations on the linear discriminant variables is higher -> model is more precious. LD1 = 0.96 means this model can discriminate 96% difference between groups

In the above picture we can see the linear combination of the variables that separate the crime classes.

Question 6) Predicting classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda_crime, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        1    0
##   med_low    2      17        5    0
##   med_high   0       3       16    0
##   high       0       0        0   34

In the above picture we can see how the model perdicts the Crime classes. The model is able to predict closely to real value in “high”" and “low”" rates.

Question 7) Standardazing the dataset and running the k-means algorithm

Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled)
dist_eu <- dist(Boston)
summary (dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
dist_man <- dist(Boston, method = 'manhattan')
summary (dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster )

#K-means: determine the k
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

According to the qplot above, the optimal number of clusters is around 2, because afther that the WCSS changes dramatically

km <-kmeans(Boston, centers = 2)
#plot focus on some variables
pairs(Boston , col = km$cluster)

pairs(Boston[1:5] , col = km$cluster)

pairs(Boston[6:10] , col = km$cluster)

pairs(Boston[11:14] , col = km$cluster)

Super-Bonus:

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda_crime$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_crime$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').





library (ggplot2)
library (tidyr)
library (stringr)
library (dplyr)
library (corrplot)
library(GGally)
library(stringr)
library (psych)

Q1) Graphical overview and summary variables in the data

#read data
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", header = T)
#look at 10 first values
head (human)
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297    17.5     81.6 64992       4       7.8
## Australia   0.9968288 0.8189415    20.2     82.4 42261       6      12.1
## Switzerland 0.9834369 0.8251001    15.8     83.0 56431       6       1.9
## Denmark     0.9886128 0.8840361    18.7     80.2 44025       5       5.1
## Netherlands 0.9690608 0.8286119    17.9     81.6 45435       6       6.2
## Germany     0.9927835 0.8072289    16.5     80.9 43919       7       3.8
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
names (human)
## [1] "Edu2.FM"   "Labo.FM"   "Edu.Exp"   "Life.Exp"  "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"
str (human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary (human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
human$GNI <- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric()
str(human$GNI)
##  num [1:155] 64992 42261 56431 44025 45435 ...
#plot to see correlation between variables in data
fig1 <- pairs(human, pch = 21,bg = c ("red", "green3", "blue", "purple", "violet", "yellow", "orange","grey"))

fig2 <-cor(human) %>% corrplot.mixed()

* variables explanation: The dataset has 155 observation and 8 variables Edu2.FM: education ratio

Labo.Fm: participation ratio

Edu.Exp: mean of yeas of education

Life.Exp: measures life expectancy at birth

GNI: Gross national income per capital

Mat.Mor: maternity mortality

Ado.Birth: Adolescent of birth rate

Parli.F: Percetange of female representatives in parliament

  • From the fig1 and fig2, we can see the correlations between the variables.

  • There seems to be strong positive correlations between: Edu.Exp and Life.Exp, Mat.Mor and Ado.Birth

  • A negative correlation: between Mat.Mor and Life.Exp

Q2) Perform principal component analysis (PCA) on the not standardized human data

pca_human.notstand <- prcomp(human)
summary1<- summary (pca_human.notstand)
print (summary1)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*summary1$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human.notstand, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

From the unstandardized PCA, we can see the arrows visualize the connections between the original variables and the PC’s we can see that the Principal components cannot be separated as all the components are in PC1 and it explains the variance completely.

Q3) Perform principal component analysis (PCA) on the standardized human data

human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary2<- summary(pca_human_std)
print (summary2)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
pca_pr <- round(100*summary2$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human_std, cex = c(0.8, 1), col = c("orange", "green4"), xlab = pc_lab[1], ylab = pc_lab[2])

Q4 Give personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

In the standardized model the first components (PC1) explain 53.6 % of the variance and the second dimension (PC2) explaints 16.2 % of the variance. PC3 has 9.5 % of the variance and so on

5 Tea dataset analysis

install packages Factominer first install.packages(“FactoMineR”)

library (ggplot2)
library (tidyr)
library (stringr)
library (dplyr)
library (corrplot)
library(GGally)
library(stringr)
library (psych)
library(FactoMineR)
data(tea)
#structure of data tea
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#dimentions ofdata

# visualize

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

After the restructuration the data has 300 observations with the selected 6 variables. On the visualisation we can see how the participants drink their tee. For example we can see that most of them do not drink tee at lunch. Earl Grey is the most common label and these participants like to drink their tee plain withoutlemon, or milk. Only thing almost half of the participants like to add is sugar.

MCA on tea data

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")

From the MCA visualisation we can see how the participants are put into different dimensions. For example participants who take their tea fro tea shops like to drink their tea unpackaged and like to drink geen tea. Aslo we can see that milk is more ofen with Earl grey than black or green tea.




***